-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #169 #405
base: main
Are you sure you want to change the base?
Closes #169 #405
Conversation
f"QATestSetMedQrels_judged_answers": f"{_DATA_PATH}/QA-TestSet-LiveQA-Med-Qrels-2479-Answers.zip", | ||
} | ||
|
||
_SUPPORTED_TASKS = [Tasks.QUESTION_ANSWERING] # TODO: shall we add a non-existing task type such as `RQE`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the issue description, it says it supports QA and RQE, is it enough to put the _SUPPORTED_TASKS this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm... I think we should get away by using constants.Tasks.TEXTUAL_ENTAILMENT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quiting from the readme:
We included additional annotations in the XML files, that could be used for diverse IR and NLP tasks, such as the question type, the question focus, its syonyms, its UMLS Concept Unique Identifier (CUI) and Semantic Type.
So it seems it could be possible to add NAMED_ENTITY_RECOGNTION/DISAMBIGUATION
to this but not sure...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data provided in the XML files doesn't seem to be structured as NER e.g. this sample
I'll take another look to see if I could possibly parse them.
|
||
_HOMEPAGE = "https://github.com/abachaa/MedQuAD" | ||
|
||
_LICENSE = "https://creativecommons.org/licenses/by/4.0/legalcode" # TODO: terms aren't available in the repository! In the issue it is 'CC BY 4.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in the description it says CC BY 4.0
is the license type, so I search the license terms online, I just wanted to make sure this is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did you find the license of the dataset? I cannot seem to find it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR closes #169 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giyaseddin Thank you very much for your contribution! Oh my, this seems to be a nasty one... Could you please check my comments?
I would like to help you out more, but first we should get the datalaoder to download the data in a reasonable amount of time/steps and see what we've got. Thanks!
|
||
_HOMEPAGE = "https://github.com/abachaa/MedQuAD" | ||
|
||
_LICENSE = "https://creativecommons.org/licenses/by/4.0/legalcode" # TODO: terms aren't available in the repository! In the issue it is 'CC BY 4.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did you find the license of the dataset? I cannot seem to find it...
f"QATestSetMedQrels_judged_answers": f"{_DATA_PATH}/QA-TestSet-LiveQA-Med-Qrels-2479-Answers.zip", | ||
} | ||
|
||
_SUPPORTED_TASKS = [Tasks.QUESTION_ANSWERING] # TODO: shall we add a non-existing task type such as `RQE`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm... I think we should get away by using constants.Tasks.TEXTUAL_ENTAILMENT
|
||
qa_pairs_enriched_fpath = self._dump_xml_to_json(dl_manager) | ||
|
||
# There is no canonical train/valid/test set in this dataset. So, only TRAIN is added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they use this for testing: https://github.com/abachaa/LiveQA_MedicalTask_TREC2017/tree/master/TestDataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this set, but the general scheme doesn't match, I implemented it to barely match the schema.
f"QATestSetMedQrels_judged_answers": f"{_DATA_PATH}/QA-TestSet-LiveQA-Med-Qrels-2479-Answers.zip", | ||
} | ||
|
||
_SUPPORTED_TASKS = [Tasks.QUESTION_ANSWERING] # TODO: shall we add a non-existing task type such as `RQE`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quiting from the readme:
We included additional annotations in the XML files, that could be used for diverse IR and NLP tasks, such as the question type, the question focus, its syonyms, its UMLS Concept Unique Identifier (CUI) and Semantic Type.
So it seems it could be possible to add NAMED_ENTITY_RECOGNTION/DISAMBIGUATION
to this but not sure...
Thank you for checking out my comments @giyaseddin ! I am trying to inspect the dataset with: [ins] In [7]: from datasets import load_dataset
ds = load_dataset("biodatasets/medquad/medquad.py", "medquad_source") But I get this error: 142 raise NotImplementedError("Only `source` and `bigbio_qa` schemas are implemented.")
144 return datasets.DatasetInfo(
145 description=_DESCRIPTION,
146 features=features,
(...)
149 citation=_CITATION,
150 )
--> 152 def _load_qa_from_xml(self, file_paths) -> List[dict[str, str | None]]:
153 """
154 This method traverses the whole list of the downloaded XML files and extracts Q&A pairs.
155 Returns the extracted Q&As and the base directory of the dumped json file that contains them all.
156 """
157 assert len(file_paths)
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType' Could you please make sure we can load both |
Hey @giyaseddin! Do you plan to work anymore on this? |
Hey @regel-corpus, |
Could you please check the current if it downloads correctly @sg-wbi? |
Hi @giyaseddin, I pulled the latest code, and it seems like this error still occurs upon loading. Could you check again if you have fixed it in your updates?
|
hi @giyaseddin, thanks for putting the effort to continue working on this dataset. Would it be possible to pull the up-to-date master into your branch? There are some inconsistencies between your branch and master, which blocks running the unit tests. Thanks! |
Checkbox
biodatasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_BIGBIO_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneBigBioConfig
for the source schema and one for a bigbio schema.datasets.load_dataset
function.python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py
.